사전준비

OpenAI 설정

In [62]:

코드

import openai
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key  = os.getenv('ENV_OPENAI_API_KEY')

데이터셋

In [63]:

코드

# !pip install palmerpenguins
import pandas as pd
from palmerpenguins import load_penguins

penguins_raw = load_penguins()
penguins_raw.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

결측값

결측값 현황

In [64]:

코드

def show_missing(df):
    """Return a Pandas dataframe describing the contents of a source dataframe including missing values."""
    
    variables = []
    dtypes = []
    count = []
    unique = []
    missing = []
    pc_missing = []
    
    for item in df.columns:
        variables.append(item)
        dtypes.append(df[item].dtype)
        count.append(len(df[item]))
        unique.append(len(df[item].unique()))
        missing.append(df[item].isna().sum())
        pc_missing.append(round((df[item].isna().sum() / len(df[item])) * 100, 2))

    output = pd.DataFrame({
        'variable': variables, 
        'dtype': dtypes,
        'count': count,
        'unique': unique,
        'missing': missing, 
        'pc_missing': pc_missing
    })    
        
    return output
# penguins.isna().sum()
show_missing(penguins_raw)

	variable	dtype	count	unique	missing	pc_missing
0	species	object	344	3	0	0.00
1	island	object	344	3	0	0.00
2	bill_length_mm	float64	344	165	2	0.58
3	bill_depth_mm	float64	344	81	2	0.58
4	flipper_length_mm	float64	344	56	2	0.58
5	body_mass_g	float64	344	95	2	0.58
6	sex	object	344	3	11	3.20
7	year	int64	344	3	0	0.00

결측값 제거

In [74]:

코드

# !pip install llama-index
# penguins = penguins_raw.dropna()

import pandas as pd
from llama_index.indices.struct_store import GPTPandasIndex

penguins_raw_idx = GPTPandasIndex(df=penguins_raw)
raw_query_engine = penguins_raw_idx.as_query_engine(verbose=True)
response = raw_query_engine.query("""remove NaN values from the dataframe""")
# response = query_engine.query("""What is the pairwise correlation of the float64 datatype columns""")
 
print(response.response)

> Pandas Instructions:
```

df.dropna()
```
> Pandas Output:        species     island  bill_length_mm  bill_depth_mm  flipper_length_mm   
0       Adelie  Torgersen            39.1           18.7              181.0  \
1       Adelie  Torgersen            39.5           17.4              186.0   
2       Adelie  Torgersen            40.3           18.0              195.0   
4       Adelie  Torgersen            36.7           19.3              193.0   
5       Adelie  Torgersen            39.3           20.6              190.0   
..         ...        ...             ...            ...                ...   
339  Chinstrap      Dream            55.8           19.8              207.0   
340  Chinstrap      Dream            43.5           18.1              202.0   
341  Chinstrap      Dream            49.6           18.2              193.0   
342  Chinstrap      Dream            50.8           19.0              210.0   
343  Chinstrap      Dream            50.2           18.7              198.0   

     body_mass_g     sex  year  
0         3750.0    male  2007  
1         3800.0  female  2007  
2         3250.0  female  2007  
4         3450.0  female  2007  
5         3650.0    male  2007  
..           ...     ...   ...  
339       4000.0    male  2009  
340       3400.0  female  2009  
341       3775.0    male  2009  
342       4100.0    male  2009  
343       3775.0  female  2009  

[333 rows x 8 columns]
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm   
0       Adelie  Torgersen            39.1           18.7              181.0  \
1       Adelie  Torgersen            39.5           17.4              186.0   
2       Adelie  Torgersen            40.3           18.0              195.0   
4       Adelie  Torgersen            36.7           19.3              193.0   
5       Adelie  Torgersen            39.3           20.6              190.0   
..         ...        ...             ...            ...                ...   
339  Chinstrap      Dream            55.8           19.8              207.0   
340  Chinstrap      Dream            43.5           18.1              202.0   
341  Chinstrap      Dream            49.6           18.2              193.0   
342  Chinstrap      Dream            50.8           19.0              210.0   
343  Chinstrap      Dream            50.2           18.7              198.0   

     body_mass_g     sex  year  
0         3750.0    male  2007  
1         3800.0  female  2007  
2         3250.0  female  2007  
4         3450.0  female  2007  
5         3650.0    male  2007  
..           ...     ...   ...  
339       4000.0    male  2009  
340       3400.0  female  2009  
341       3775.0    male  2009  
342       4100.0    male  2009  
343       3775.0  female  2009  

[333 rows x 8 columns]

챗GPT 데이터 연결

In [73]:

코드

penguins = penguins_raw.dropna()

penguins_idx = GPTPandasIndex(df=penguins)
query_engine = penguins_idx.as_query_engine(verbose=True)

#response = query_engine.query("""데이터셋에 있는 행 수와 열 수를 반환합니다.\n
#                                 응답은 다음 태그에 표시된 대로 키-값 객체여야 합니다.\n 
#                                 <{'행':..., '열':...}>""")

response = query_engine.query("""Return how many rows and how many columns are in the dataset.\n
                                 The response must be a key-value object as we show in the tags:\n 
                                 <{'rows':..., 'columns':...}>""")
 
print(response.response)

> Pandas Instructions:
```
{'rows': df.shape[0], 'columns': df.shape[1]}
```
> Pandas Output: {'rows': 333, 'columns': 8}
{'rows': 333, 'columns': 8}

데이터 살펴보기

In [67]:

코드

penguins.info()

<class 'pandas.core.frame.DataFrame'>
Index: 333 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 23.4+ KB

상관관계

In [68]:

코드

response = query_engine.query("""list the float64 datatype columns and calculate the pairwise correlation of the float64 datatype columns""")
# response = query_engine.query("""What is the pairwise correlation of the float64 datatype columns""")
 
print(response.response)

> Pandas Instructions:
```

df.select_dtypes(include=['float64']).corr(method='pearson')
```
> Pandas Output:                    bill_length_mm  bill_depth_mm  flipper_length_mm   
bill_length_mm           1.000000      -0.228626           0.653096  \
bill_depth_mm           -0.228626       1.000000          -0.577792   
flipper_length_mm        0.653096      -0.577792           1.000000   
body_mass_g              0.589451      -0.472016           0.872979   

                   body_mass_g  
bill_length_mm        0.589451  
bill_depth_mm        -0.472016  
flipper_length_mm     0.872979  
body_mass_g           1.000000  
                   bill_length_mm  bill_depth_mm  flipper_length_mm   
bill_length_mm           1.000000      -0.228626           0.653096  \
bill_depth_mm           -0.228626       1.000000          -0.577792   
flipper_length_mm        0.653096      -0.577792           1.000000   
body_mass_g              0.589451      -0.472016           0.872979   

                   body_mass_g  
bill_length_mm        0.589451  
bill_depth_mm        -0.472016  
flipper_length_mm     0.872979  
body_mass_g           1.000000

요약통계량

In [69]:

코드

response = query_engine.query("""list the float64 datatype columns and calcuate the average values by species""")
 
print(response.response)

> Pandas Instructions:
```

df.groupby('species')[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].mean().reset_index()
```
> Pandas Output:      species  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
0     Adelie       38.823973      18.347260         190.102740  3706.164384
1  Chinstrap       48.833824      18.420588         195.823529  3733.088235
2     Gentoo       47.568067      14.996639         217.235294  5092.436975
     species  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
0     Adelie       38.823973      18.347260         190.102740  3706.164384
1  Chinstrap       48.833824      18.420588         195.823529  3733.088235
2     Gentoo       47.568067      14.996639         217.235294  5092.436975

시각화

산점도

In [70]:

코드

response = query_engine.query("""visualize two columns; body_mass_g and flipper_length_mm with scatterplot""")
 
print(response.response)

> Pandas Instructions:
```

df.plot.scatter(x='body_mass_g', y='flipper_length_mm')
```
> Pandas Output: Axes(0.125,0.11;0.775x0.77)
Axes(0.125,0.11;0.775x0.77)

히스토그램

In [71]:

코드

# import seaborn as sns
response = query_engine.query("""draw distribution plot for body_mass_g with transparency by species with legend""")
 
print(response.response)

> Pandas Instructions:
```

df.groupby('species')['body_mass_g'].plot.kde(legend=True, alpha=0.5)
```
> Pandas Output: species
Adelie       Axes(0.125,0.11;0.775x0.77)
Chinstrap    Axes(0.125,0.11;0.775x0.77)
Gentoo       Axes(0.125,0.11;0.775x0.77)
Name: body_mass_g, dtype: object
species
Adelie       Axes(0.125,0.11;0.775x0.77)
Chinstrap    Axes(0.125,0.11;0.775x0.77)
Gentoo       Axes(0.125,0.11;0.775x0.77)
Name: body_mass_g, dtype: object

모형

In [72]:

코드

# !pip install -U scikit-learn
from sklearn.linear_model import LogisticRegression
response = query_engine.query("""build a `sex` classification model using logistic regression step-by-step""")
 
print(response.response)

> Pandas Instructions:
```

eval('LogisticRegression().fit(df[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]], df["sex"]).predict(df[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]])')
```
> Pandas Output: There was an error running the output as Python code. Error message: name 'LogisticRegression' is not defined
There was an error running the output as Python code. Error message: name 'LogisticRegression' is not defined

Traceback (most recent call last):
  File "C:\Users\statkclee\anaconda3\lib\site-packages\llama_index\indices\struct_store\pandas_query.py", line 58, in default_output_processor
    raise e
  File "C:\Users\statkclee\anaconda3\lib\site-packages\llama_index\indices\struct_store\pandas_query.py", line 56, in default_output_processor
    return str(eval(module_end_str, {}, local_vars))
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name 'LogisticRegression' is not defined

llama-index 쥬피터 노트북

OpenAI 설정

데이터셋

결측값

결측값 현황

결측값 제거

챗GPT 데이터 연결

데이터 살펴보기

상관관계

요약통계량

시각화

산점도

히스토그램

모형